Show the code
spain = sf::read_sf("data/spain.gpkg")
covariates = terra::rast("data/predictors.tif")
temperature = sf::read_sf("data/temp_train.gpkg")
temperature = terra::extract(covariates, temperature, bind = TRUE) |>
sf::st_as_sf()spain = sf::read_sf("data/spain.gpkg")
covariates = terra::rast("data/predictors.tif")
temperature = sf::read_sf("data/temp_train.gpkg")
temperature = terra::extract(covariates, temperature, bind = TRUE) |>
sf::st_as_sf()The RandomForestsGLS (https://doi.org/10.21105/joss.03780) package implements the Generalised Least Square (GLS) based Random Forest (RF-GLS) algorithm.1 This approach is designed for spatial data modeling as it accounts for spatial dependencies in the data by:
The package provides four functions:
RFGLS_estimate_spatial() for estimation in spatial dataRFGLS_predict() for prediction of the mean functionRFGLS_predict_spatial() for prediction of the spatial responseRFGLS_estimate_timeseries() for estimation in time series data (not discussed here)The package has rather unintuitive syntax and requires the data to be in a specific format. We need to provide the coordinates of the data (a matrix), the response variable (a vector), and the covariates (a matrix). In the example below, I limited the covariate matrix to the variables that are not spatial proxies.
library(RandomForestsGLS)
coords = sf::st_coordinates(temperature)
temp_response = temperature$temp
temperature_df = sf::st_drop_geometry(temperature)
covariate_names = colnames(temperature_df)[2:(ncol(temperature_df) - 7)]
covariate_matrix = as.matrix(temperature_df[, covariate_names])For the example, we also split the data into training and testing sets based on created random indices.
set.seed(2025-01-30)
train_idx = sample(1:nrow(coords), floor(nrow(coords) * 0.7))The RFGLS_estimate_spatial() function is used to fit the RF-GLS model. Here, we customize the number of trees to 100, but the function has many other parameters that can be adjusted.
estimation_result = RFGLS_estimate_spatial(
coords = coords[train_idx, ],
y = temp_response[train_idx],
X = covariate_matrix[train_idx, ],
ntree = 100
)
str(estimation_result)List of 7
$ P_matrix : int [1:136, 1:100] 18 78 88 123 31 80 49 67 65 98 ...
$ predicted_matrix: num [1:136, 1:100] 17 17 13 14.9 13 ...
$ predicted : num [1:136] 16 15.4 10.9 13.8 15 ...
$ X : num [1:136, 1:15] 727 20.7 0 203.7 0 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:136] "88" "106" "24" "11" ...
.. ..$ : chr [1:15] "popdens" "coast" "dem" "imd" ...
$ y : num [1:136] 16.13 15.21 7.53 13.61 15.59 ...
$ coords : num [1:136, 1:2] 495960 523533 273723 618599 616251 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr [1:2] "X" "Y"
$ RFGLS_object :List of 6
..$ ldaughter : int [1:273, 1:100] 2 4 6 8 0 10 12 0 14 0 ...
..$ rdaughter : int [1:273, 1:100] 3 5 7 9 0 11 13 0 15 0 ...
..$ nodestatus: int [1:273, 1:100] -3 -3 -3 -3 -1 -3 -3 -1 -3 -1 ...
..$ upper : num [1:273, 1:100] 27.029 0.216 76.038 0.287 0 ...
..$ avnode : num [1:273, 1:100] 0 0 0 0 15.3 ...
..$ mbest : int [1:273, 1:100] 14 10 2 8 0 2 8 0 2 0 ...
The result is a list with seven elements: a matrix of zero-indexed resamples, a matrix of predictions (ntest x ntree), a vector of predicted values, the covariate matrix, the response variable, the coordinates matrix, and the RF-GLS object.
Now, we can use the fitted model to predict the mean function (RFGLS_predict()) or the spatial response (RFGLS_predict_spatial()). The difference (as far as I understand) is that the former returns the mean prediction, while the latter uses the spatial coordinates in addition to the covariates to predict the spatial response.
The first function returns a list with two elements: a matrix of predictions (ntest x ntree) and a vector of predicted values, while the second function returns a list with just one element: a vector of predicted values. Just note that the predictions by the RFGLS_predict() are named "predicted" and the predictions by the RFGLS_predict_spatial() are named "prediction".
prediction_result = RFGLS_predict(
RFGLS_out = estimation_result,
Xtest = covariate_matrix[-train_idx, ]
)
prediction_result_spatial = RFGLS_predict_spatial(
RFGLS_out = estimation_result,
coords.0 = coords[-train_idx, ],
Xtest = covariate_matrix[-train_idx, ]
)
plot(prediction_result$predicted, prediction_result_spatial$prediction)The final results of these two approaches are v. similar, but not identical.
Now, let’s predict the models’ results on the whole dataset.
covariate_coords_r = terra::crds(covariates)
covariate_matrix_r = as.matrix(covariates)
covariate_matrix_r = covariate_matrix_r[, covariate_names]
pred_s = RFGLS_predict_spatial(
RFGLS_out = estimation_result,
coords.0 = covariate_coords_r,
Xtest = covariate_matrix_r
)
pred_r = terra::setValues(covariates[[1]], pred_s$prediction)
names(pred_r) = "prediction"
terra::plot(pred_r)The spatialRF (https://blasbenito.github.io/spatialRF/) package’s aim is to provide a minimal code interface to fit spatial regression models with Random Forest. The internal calculations are based on three general methods to generate spatial predictors from the distance matrix of the data points: Distance matrix columns as explanatory variables (Hengl et al. 2018), Moran’s Eigenvector Maps (Dray, Legendre, and Peres-Neto 2006) and PCAs. The ranger package is used here internally to fit the Random Forest model.
This package also requires the data to be in a specific format. We need to provide the data as a data frame with the dependent variable, including spatial coordinates, and the distance matrix: a matrix with the distances among the records in the data frame.
library(spatialRF)
library(sf)
coordinates = st_coordinates(temperature)
colnames(coordinates) = c("x", "y")
coordinates = as.data.frame(coordinates)
temperature_df = st_drop_geometry(temperature)
temperature_df$x = coordinates[, 1]
temperature_df$y = coordinates[, 2]
distance_matrix = as.matrix(dist(temperature_df[2:(ncol(temperature_df) - 9)]))We also need to define the dependent variable and the predictor variables.
response_name = "temp"
covariate_names = colnames(temperature_df)[2:(ncol(temperature_df) - 9)]Finally, we can fit the models using one of the methods provided by the package. The package has 10 methods implemented, nine of which are based on the three components:2
"hengl", "mem", or "pca")"moran" or "effect")"sequential" or "recursive")The main function of this package is rf_spatial(), which fits the Random Forest model with spatial predictors. Here, an example using Moran’s Eigenvector Maps method to generate spatial predictors, Moran’s I to rank them, and sequential selection of the predictors is shown.
rf_spatial_moran = rf_spatial(
data = temperature_df,
dependent.variable.name = response_name,
predictor.variable.names = covariate_names,
distance.matrix = distance_matrix,
distance.thresholds = 0,
method = "mem.moran.sequential",
n.cores = 1
)
rf_spatial_moranModel type
- Fitted with: ranger()
- Response variable: temp
Random forest parameters
- Type: Regression
- Number of trees: 500
- Sample size: 195
- Number of predictors: 15
- Mtry: 3
- Minimum node size: 5
Model performance
- R squared (oob): 0.887361
- R squared (cor(obs, pred)^2): 0.9844978
- Pseudo R squared (cor(obs, pred)):0.9922186
- RMSE (oob): 0.9476691
- RMSE: 0.4131
- Normalized RMSE: 0.103203
Model residuals
- Stats:
┌───────┬────────┬────────┬──────┬────────┬──────┐
│ Min. │ 1st Q. │ Median │ Mean │ 3rd Q. │ Max. │
├───────┼────────┼────────┼──────┼────────┼──────┤
│ -1.87 │ -0.24 │ 0.02 │ 0.00 │ 0.26 │ 1.53 │
└───────┴────────┴────────┴──────┴────────┴──────┘
- Normality:
- Shapiro-Wilks W: 0.971
- p-value : 4e-04
- Interpretation : Residuals are not normal
- Spatial autocorrelation:
┌──────────┬───────────┬─────────┬──────────────────┐
│ Distance │ Moran's I │ P value │ Interpretation │
├──────────┼───────────┼─────────┼──────────────────┤
│ 0.0 │ 0.005 │ 0.258 │ No spatial │
│ │ │ │ correlation │
└──────────┴───────────┴─────────┴──────────────────┘
Variable importance:
┌──────────────┬────────────┐
│ Variable │ Importance │
├──────────────┼────────────┤
│ lst_night │ 1.966 │
├──────────────┼────────────┤
│ lst_day │ 1.342 │
├──────────────┼────────────┤
│ dem │ 1.245 │
├──────────────┼────────────┤
│ CAMSpm25 │ 0.936 │
├──────────────┼────────────┤
│ coast │ 0.745 │
├──────────────┼────────────┤
│ ndvi │ 0.683 │
├──────────────┼────────────┤
│ imd │ 0.420 │
├──────────────┼────────────┤
│ ntl │ 0.401 │
├──────────────┼────────────┤
│ industry │ 0.291 │
├──────────────┼────────────┤
│ popdens │ 0.270 │
├──────────────┼────────────┤
│ urban │ 0.218 │
├──────────────┼────────────┤
│ natural │ 0.212 │
├──────────────┼────────────┤
│ agriculture │ 0.196 │
├──────────────┼────────────┤
│ otherroads │ 0.181 │
├──────────────┼────────────┤
│ primaryroads │ 0.095 │
└──────────────┴────────────┘
The rf_spatial() returns a ranger model with several new slows, most importantly residuals that contain information about the residuals, and spatial that contains information about the selected spatial predictors and the method used to select them. Printing the model object provides a summary of the model, including its parameters, model performance, information on model residuals, and variable importance.
The spatialRF package also provides a set of additional functions. It includes a function for reducing multicollinearity in the predictors and removing redundant spatial predictors (filter_spatial_predictors()); or finding promising variable interactions (the_feature_engineer()):
interactions = the_feature_engineer(
data = temperature_df,
dependent.variable.name = response_name,
predictor.variable.names = covariate_names,
xy = coordinates,
importance.threshold = 0.50, #uses 50% best predictors
cor.threshold = 0.60, #max corr between interactions and predictors
seed = 2025-01-30,
repetitions = 100,
verbose = TRUE
) ┌──────────────────┬──────────────────┬──────────────────┬──────────────────┐
│ Interaction │ Importance (% of │ R-squared │ Max cor with │
│ │ max) │ improvement │ predictors │
├──────────────────┼──────────────────┼──────────────────┼──────────────────┤
│ dem..x..CAMSpm25 │ 33.6 │ 0.020 │ 0.52 │
├──────────────────┼──────────────────┼──────────────────┼──────────────────┤
│ dem..x..imd │ 22.1 │ 0.014 │ 0.54 │
├──────────────────┼──────────────────┼──────────────────┼──────────────────┤
│ imd..pca..ntl │ 13.6 │ 0.013 │ 0.454 │
└──────────────────┴──────────────────┴──────────────────┴──────────────────┘
The rf_evaluate() function allows the evaluation of the model using spatial cross-validation.
rf_eval = rf_evaluate(
model = rf_spatial_moran,
xy = coordinates,
repetitions = 30,
training.fraction = 0.75,
metrics = "rmse",
seed = 2025-01-30,
verbose = TRUE
)
Spatial evaluation
- Training fraction: 0.75
- Spatial folds: 30
Metric Median MAD Minimum Maximum
rmse 1.428 0.221 0.815 1.968
rf_evalModel type
- Fitted with: ranger()
- Response variable: temp
Random forest parameters
- Type: Regression
- Number of trees: 500
- Sample size: 195
- Number of predictors: 15
- Mtry: 3
- Minimum node size: 5
Model performance
- R squared (oob): 0.887361
- R squared (cor(obs, pred)^2): 0.9844978
- Pseudo R squared (cor(obs, pred)):0.9922186
- RMSE (oob): 0.9476691
- RMSE: 0.4131
- Normalized RMSE: 0.103203
Model residuals
- Stats:
┌───────┬────────┬────────┬──────┬────────┬──────┐
│ Min. │ 1st Q. │ Median │ Mean │ 3rd Q. │ Max. │
├───────┼────────┼────────┼──────┼────────┼──────┤
│ -1.87 │ -0.24 │ 0.02 │ 0.00 │ 0.26 │ 1.53 │
└───────┴────────┴────────┴──────┴────────┴──────┘
- Normality:
- Shapiro-Wilks W: 0.971
- p-value : 4e-04
- Interpretation : Residuals are not normal
- Spatial autocorrelation:
┌──────────┬───────────┬─────────┬──────────────────┐
│ Distance │ Moran's I │ P value │ Interpretation │
├──────────┼───────────┼─────────┼──────────────────┤
│ 0.0 │ 0.005 │ 0.258 │ No spatial │
│ │ │ │ correlation │
└──────────┴───────────┴─────────┴──────────────────┘
Variable importance:
┌──────────────┬────────────┐
│ Variable │ Importance │
├──────────────┼────────────┤
│ lst_night │ 1.966 │
├──────────────┼────────────┤
│ lst_day │ 1.342 │
├──────────────┼────────────┤
│ dem │ 1.245 │
├──────────────┼────────────┤
│ CAMSpm25 │ 0.936 │
├──────────────┼────────────┤
│ coast │ 0.745 │
├──────────────┼────────────┤
│ ndvi │ 0.683 │
├──────────────┼────────────┤
│ imd │ 0.420 │
├──────────────┼────────────┤
│ ntl │ 0.401 │
├──────────────┼────────────┤
│ industry │ 0.291 │
├──────────────┼────────────┤
│ popdens │ 0.270 │
├──────────────┼────────────┤
│ urban │ 0.218 │
├──────────────┼────────────┤
│ natural │ 0.212 │
├──────────────┼────────────┤
│ agriculture │ 0.196 │
├──────────────┼────────────┤
│ otherroads │ 0.181 │
├──────────────┼────────────┤
│ primaryroads │ 0.095 │
└──────────────┴────────────┘
Spatial evaluation
- Training fraction: 0.75
- Spatial folds: 30
Metric Median MAD Minimum Maximum
rmse 1.428 0.221 0.815 1.968
The rf_importance() function allows for visualizing the variable importance of the model.
rf_imp = rf_importance(
rf_spatial_moran,
xy = coordinates
)rf_impModel type
- Fitted with: ranger()
- Response variable: temp
Random forest parameters
- Type: Regression
- Number of trees: 500
- Sample size: 195
- Number of predictors: 15
- Mtry: 3
- Minimum node size: 5
Model performance
- R squared (oob): 0.887361
- R squared (cor(obs, pred)^2): 0.9844978
- Pseudo R squared (cor(obs, pred)):0.9922186
- RMSE (oob): 0.9476691
- RMSE: 0.4131
- Normalized RMSE: 0.103203
Model residuals
- Stats:
┌───────┬────────┬────────┬──────┬────────┬──────┐
│ Min. │ 1st Q. │ Median │ Mean │ 3rd Q. │ Max. │
├───────┼────────┼────────┼──────┼────────┼──────┤
│ -1.87 │ -0.24 │ 0.02 │ 0.00 │ 0.26 │ 1.53 │
└───────┴────────┴────────┴──────┴────────┴──────┘
- Normality:
- Shapiro-Wilks W: 0.971
- p-value : 4e-04
- Interpretation : Residuals are not normal
- Spatial autocorrelation:
┌──────────┬───────────┬─────────┬──────────────────┐
│ Distance │ Moran's I │ P value │ Interpretation │
├──────────┼───────────┼─────────┼──────────────────┤
│ 0.0 │ 0.005 │ 0.258 │ No spatial │
│ │ │ │ correlation │
└──────────┴───────────┴─────────┴──────────────────┘
Variable importance:
┌────────────┬────────────┬─────────┬───────┬──────┬─────┐
│ Variable │ Importance │ │ │ │ │
├────────────┼────────────┼─────────┼───────┼──────┼─────┤
│ lst_night │ 1.966 │ 0.13 │ 0.07 │ 15.5 │ 8.4 │
├────────────┼────────────┼─────────┼───────┼──────┼─────┤
│ lst_day │ 1.342 │ 0.0695 │ 0.035 │ 8.3 │ 4.2 │
├────────────┼────────────┼─────────┼───────┼──────┼─────┤
│ dem │ 1.245 │ 0.045 │ 0.019 │ 5.4 │ 2.3 │
├────────────┼────────────┼─────────┼───────┼──────┼─────┤
│ CAMSpm25 │ 0.936 │ -0.014 │ 0.021 │ -1.7 │ 2.5 │
├────────────┼────────────┼─────────┼───────┼──────┼─────┤
│ coast │ 0.745 │ -0.01 │ 0.01 │ -1.2 │ 1.1 │
├────────────┼────────────┼─────────┼───────┼──────┼─────┤
│ ndvi │ 0.683 │ 0.0055 │ 0.013 │ 0.7 │ 1.5 │
├────────────┼────────────┼─────────┼───────┼──────┼─────┤
│ imd │ 0.420 │ -0.005 │ 0.004 │ -0.6 │ 0.5 │
├────────────┼────────────┼─────────┼───────┼──────┼─────┤
│ ntl │ 0.401 │ -0.007 │ 0.007 │ -0.8 │ 0.8 │
├────────────┼────────────┼─────────┼───────┼──────┼─────┤
│ industry │ 0.291 │ 0 │ 0.004 │ 0 │ 0.4 │
├────────────┼────────────┼─────────┼───────┼──────┼─────┤
│ popdens │ 0.270 │ -0.003 │ 0.005 │ -0.4 │ 0.6 │
├────────────┼────────────┼─────────┼───────┼──────┼─────┤
│ urban │ 0.218 │ -0.001 │ 0.006 │ -0.1 │ 0.7 │
├────────────┼────────────┼─────────┼───────┼──────┼─────┤
│ natural │ 0.212 │ -0.007 │ 0.009 │ -0.8 │ 1.1 │
├────────────┼────────────┼─────────┼───────┼──────┼─────┤
│ agricultur │ 0.196 │ 0 │ 0.008 │ 0 │ 1 │
│ e │ │ │ │ │ │
├────────────┼────────────┼─────────┼───────┼──────┼─────┤
│ otherroads │ 0.181 │ -0.004 │ 0.004 │ -0.5 │ 0.5 │
├────────────┼────────────┼─────────┼───────┼──────┼─────┤
│ primaryroa │ 0.095 │ -0.003 │ 0.004 │ -0.4 │ 0.5 │
│ ds │ │ │ │ │ │
└────────────┴────────────┴─────────┴───────┴──────┴─────┘
Spatial evaluation
- Training fraction: 0.75
- Spatial folds: 30
Metric Median MAD Minimum Maximum
r.squared 0.838 0.073 0.591 0.924
The mem() function generates Moran Eigenvector Maps (MEM) from a distance matrix.3
mem1 = mem(distance.matrix = distance_matrix)The package also contains a set of custom plot functions. One example is the plot_response_curves() function, which allows for the visualization of the response curves of the model.
plot_response_curves(rf_spatial_moran)Additional interesting functions allow for tuning the model parameters (rf_tuning()) or comparing several models (rf_compare()). A complete list of this package’s functions is available at https://blasbenito.github.io/spatialRF/reference/index.html.
The final prediction can be made using the predict() function from the terra package.
pred_srf = terra::predict(covariates, rf_spatial_moran)
terra::plot(pred_srf[[1]])The sperrorest (https://doi.org/10.32614/CRAN.package.sperrorest) package is designed for spatial error estimation and variable importance assessment for predictive models. The package itself does not fit the models but provides a set of functions for spatial cross-validation, including data partitioning and model cross-validation.
While the sperrorest package has many functions (including a set of functions for data partitioning), its main function is sperrorest(). It performs spatial cross-validation for spatial prediction models, including variable importance assessment and prediction error estimation. To use this function, we need to provide the formula, the data, the coordinates, the model function, the model arguments, the prediction function, the sampling function, and the sampling arguments.
Let’s do it step by step. First, we need to prepare the data by extracting the coordinates and creating a data frame with the dependent variable, covariates, and coordinates.
library(sperrorest)
library(ranger)
coordinates = sf::st_coordinates(temperature)
temperature_df = sf::st_drop_geometry(temperature)
temperature_df$x = coordinates[, 1]
temperature_df$y = coordinates[, 2]Second, we need to define the formula for the model and the prediction function.
response_name = "temp"
covariate_names = colnames(temperature_df)[2:(ncol(temperature_df) - 7)]
fo = as.formula(paste(response_name, "~", paste(covariate_names, collapse = " + ")))Third, we need to define the custom prediction function. The sperrorest package works with many model functions, but it requires a custom prediction function to extract the predictions from the model object. In this example, we use the ranger model, so we need to define a custom prediction function that extracts the predictions from the ranger model object. The predict() function from the ranger package returns a list with several elements, so we need to extract the predictions from this list.4
mypred = function(object, newdata) {
predict(object, newdata)$predictions
}Fourth, we can perform the spatial cross-validation using the sperrorest() function. We just need to provide previously prepared data, the formula, the model function, and the prediction function. Moreover, we can also define some additional parameters of the model, such as the number of trees in the ranger model. Finally, the important part is to define the sampling function (smp_fun) and its arguments (smp_args). The sampling function is used to partition the data into training and testing sets: here, we use the partition_kmeans() function to partition the data spatially into folds using k-means clustering of the coordinates.5
# Spatial cross-validation
sp_res = sperrorest(
formula = fo,
data = temperature_df,
coords = c("x", "y"),
model_fun = ranger,
model_args = list(num.trees = 100),
pred_fun = mypred,
smp_fun = partition_kmeans,
smp_args = list(repetition = 1:2, nfold = 3),
progress = FALSE
)The result is a list with several components, including the error at the repetition and fold levels, the resampling object, the variable importance (only when importance = TRUE), the benchmark, and the package version.
summary(sp_res$error_rep) mean sd median IQR
train_bias 5.503716e-03 0.0005710437 5.503716e-03 0.0004037888
train_stddev 4.005413e-01 0.0052912326 4.005413e-01 0.0037414665
train_rmse 4.000656e-01 0.0052760852 4.000656e-01 0.0037307557
train_mad 3.269942e-01 0.0160105606 3.269942e-01 0.0113211760
train_median 4.057216e-02 0.0075211112 4.057216e-02 0.0053182287
train_iqr 4.443592e-01 0.0054738763 4.443592e-01 0.0038706151
train_count 3.900000e+02 0.0000000000 3.900000e+02 0.0000000000
test_bias 6.902773e-02 0.0588864491 6.902773e-02 0.0416390075
test_stddev 1.459237e+00 0.0019335902 1.459237e+00 0.0013672547
test_rmse 1.457718e+00 0.0047141446 1.457718e+00 0.0033334036
test_mad 1.416826e+00 0.0454142596 1.416826e+00 0.0321127309
test_median 1.398933e-01 0.0951137382 1.398933e-01 0.0672555693
test_iqr 1.881005e+00 0.0871915974 1.881005e+00 0.0616537698
test_count 1.950000e+02 0.0000000000 1.950000e+02 0.0000000000
We can contrast the obtained results with the non-spatial cross-validation by changing the sampling function to partition_cv().
# Non-spatial cross-validation
nsp_res = sperrorest(
formula = fo,
data = temperature_df,
coords = c("x", "y"),
model_fun = ranger,
model_args = list(num.trees = 100),
pred_fun = mypred,
smp_fun = partition_cv,
smp_args = list(repetition = 1:2, nfold = 3),
progress = FALSE
)To compare both results, we can plot the RMSE values for the training and testing sets of both spatial and non-spatial cross-validation.
library(ggplot2)
# Extract train/test RMSE from spatial CV
sp_train_rmse = sp_res$error_rep$train_rmse
sp_test_rmse = sp_res$error_rep$test_rmse
# Extract train/test RMSE from non-spatial CV
nsp_train_rmse = nsp_res$error_rep$train_rmse
nsp_test_rmse = nsp_res$error_rep$test_rmse
# Build data frame
rmse_df = data.frame(
CV_Type = rep(c("Spatial", "Non-Spatial"), each = 4),
Set = rep(c("Train", "Test"), each = 2),
RMSE = c(sp_train_rmse, sp_test_rmse, nsp_train_rmse, nsp_test_rmse)
)
ggplot(rmse_df, aes(x = CV_Type, y = RMSE, fill = Set)) +
geom_boxplot() +
facet_wrap(~Set) +
labs(title = "RMSE Comparison", x = "CV Method", y = "RMSE")The results show that the estimation using the spatial-cross validation is less optimistic than the non-spatial cross-validation for the test set.
More examples of the package use can be found at https://giscience-fsu.github.io/sperrorest/articles/spatial-modeling-use-case.html/
The blockCV (https://doi.org/10.1111/2041-210X.13107) package provides a set of functions for block cross-validation, spatial and environmental clustering, and spatial autocorrelation estimation. The package itself does not fit the models.
library(blockCV)Cross-validation strategies separate the data into training and testing sets to evaluate the model’s performance. The blockCV package provides several cross-validation strategies, including block cross-validation, spatial clustering, environmental clustering, buffering LOO, and Nearest Neighbour Distance Matching (NNDM) LOO.
The block cross-validation is performed using the cv_spatial() function. It assigns blocks to the training and testing folds randomly, systematically or in a checkerboard pattern (the selection argument).
sb1 = cv_spatial(x = temperature,
k = 10, # number of folds
size = 300000, # size of the blocks in meters
selection = "random", # random blocks-to-fold
iteration = 50, # find evenly dispersed folds
progress = FALSE,
biomod2 = TRUE)
train test
1 181 14
2 171 24
3 171 24
4 175 20
5 169 26
6 165 30
7 180 15
8 182 13
9 182 13
10 179 16
The result is a list with several components, including the folds list, the folds IDs, the biomod table, the number of folds, the input size, the column name, the blocks, and the records. For example, we can check the structure of the folds list with the str() function.
str(sb1$folds_list)List of 10
$ :List of 2
..$ : int [1:181] 33 87 99 121 104 117 97 103 100 105 ...
..$ : int [1:14] 148 139 142 160 158 147 143 149 146 161 ...
$ :List of 2
..$ : int [1:171] 33 87 99 121 104 117 97 103 100 105 ...
..$ : int [1:24] 60 26 29 36 59 27 25 37 22 40 ...
$ :List of 2
..$ : int [1:171] 33 87 99 121 104 117 97 103 100 105 ...
..$ : int [1:24] 17 176 11 14 16 169 15 12 44 46 ...
$ :List of 2
..$ : int [1:175] 33 87 99 121 104 117 97 103 100 105 ...
..$ : int [1:20] 185 72 159 163 184 71 178 65 183 162 ...
$ :List of 2
..$ : int [1:169] 33 87 99 121 104 117 97 103 100 105 ...
..$ : int [1:26] 114 91 140 154 106 88 141 113 155 111 ...
$ :List of 2
..$ : int [1:165] 33 87 99 121 104 117 97 103 100 105 ...
..$ : int [1:30] 82 53 43 78 61 79 42 76 52 48 ...
$ :List of 2
..$ : int [1:180] 33 87 99 121 104 117 97 103 100 105 ...
..$ : int [1:15] 133 138 134 136 192 195 174 173 172 187 ...
$ :List of 2
..$ : int [1:182] 33 87 99 121 104 117 97 103 100 105 ...
..$ : int [1:13] 129 128 123 130 125 132 127 126 116 120 ...
$ :List of 2
..$ : int [1:182] 87 99 121 104 117 97 103 100 105 84 ...
..$ : int [1:13] 33 193 2 4 8 194 1 10 7 6 ...
$ :List of 2
..$ : int [1:179] 33 60 26 29 36 59 27 25 37 22 ...
..$ : int [1:16] 87 99 121 104 117 97 103 100 105 84 ...
The cv_plot() function additionally allows for the visualization of cross-validation results.
cv_plot(sb1, temperature)Let’s compare the results of the block cross-validation with systematic and checkerboard patterns.
sb2 = cv_spatial(x = temperature,
k = 10,
rows_cols = c(4, 6),
hexagon = FALSE,
selection = "systematic")
train test
1 172 23
2 180 15
3 177 18
4 169 26
5 178 17
6 182 13
7 180 15
8 162 33
9 178 17
10 177 18
cv_plot(sb2, temperature)sb3 = cv_spatial(x = temperature,
k = 10,
size = 300000,
hexagon = FALSE,
selection = "checkerboard")
train test
1 98 97
2 97 98
cv_plot(sb3, temperature)The clustering strategies (cv_cluster()) are used to group the data into clusters based on spatial or environmental similarity. The spatial similarity is based only on the clustering of the spatial coordinates.
set.seed(6)
scv = cv_cluster(x = temperature, k = 10) train test
1 178 17
2 173 22
3 182 13
4 169 26
5 179 16
6 176 19
7 178 17
8 180 15
9 171 24
10 169 26
cv_plot(scv, temperature)The environmental clustering, on the other hand, is based on the clustering of the values of the covariates extracted from the raster data.
set.seed(6)
ecv = cv_cluster(x = temperature, r = covariates, k = 5, scale = TRUE) train test
1 90 105
2 154 41
3 182 13
4 164 31
5 190 5
cv_plot(ecv, temperature)The next cross-validation strategy is buffering LOO (also known as Spatial LOO). It is performed using the cv_buffer() function, which selects a buffer around each point (test point) and uses the points outside the buffer as the testing set.6
bloo = cv_buffer(x = temperature, size = 300000, progress = FALSE) train test
Min. : 97.0 Min. :1
Mean :132.7 Mean :1
Max. :170.0 Max. :1
cv_plot(bloo, temperature, num_plots = c(1, 50, 100))Note that above, we plot only the first, 50th, and 100th points to avoid overplotting.
The last cross-validation strategy implemented in the blockCV package is the Nearest Neighbour Distance Matching (NNDM) LOO. It is performed using the cv_nndm() function, which tries to match the nearest neighbor distance distribution function between the test and training data to the nearest neighbor distance distribution function between the target prediction and training points. Thus, in this base, we need to provide more arguments, including a raster with the covariates, the number of samples, the sampling strategy, and the minimum training size.
nncv = cv_nndm(x = temperature,
r = covariates,
size = 300000,
num_sample = 5000,
sampling = "regular",
min_train = 0.1,
plot = TRUE) train test
Min. :192.0 Min. :1
Mean :192.9 Mean :1
Max. :193.0 Max. :1
cv_plot(nncv, temperature, num_plots = c(1, 50, 100))Let’s now use the block cross-validation to fit and evaluate a model.
# define formula
response_name = "temp"
covariate_names = colnames(temperature_df)[2:(ncol(temperature_df) - 7)]
fo = as.formula(paste(response_name, "~", paste(covariate_names, collapse = " + ")))
# extract the folds
folds = sb1$folds_list
model_rmse = data.frame(fold = seq_along(folds), rmse = rep(NA, length(folds)))
for(k in seq_along(folds)){
trainSet = unlist(folds[[k]][1]) # training set indices; first element
testSet = unlist(folds[[k]][2]) # testing set indices; second element
rf = ranger(fo, temperature_df[trainSet, ], num.trees = 100) # model fitting on training set
pred = predict(rf, temperature_df[testSet, ])$predictions # predict the test set
model_rmse[k, "rmse"] = sqrt(mean((temperature_df[testSet, response_name] - pred)^2)) # calculate RMSE
}
model_rmse fold rmse
1 1 0.5555220
2 2 1.1668933
3 3 0.8910963
4 4 0.8444776
5 5 0.8323935
6 6 1.3915134
7 7 0.8645923
8 8 0.8251324
9 9 1.1584016
10 10 1.0197603
The blockCV package also provides functions for checking the similarity between the folds (cv_similarity()) and estimating the effective range of spatial autocorrelation (cv_spatial_autocor()). The first function is used to check the similarity between the folds in the cross-validation.
cv_similarity(cv = sb1, x = temperature, r = covariates, progress = FALSE)The second function is used to estimate the effective range of spatial autocorrelation of all input raster layers or the response data – its role is to help to determine the size of the blocks in the block cross-validation.
cv_spatial_autocor(r = covariates, num_sample = 5000, progress = FALSE)[1] "cv_spatial_autocor"
More examples of the package’s use can be found at https://cran.r-project.org/web/packages/blockCV/vignettes/tutorial_2.html.
https://jamiemkass.github.io/ENMeval/articles/ENMeval-2.0-vignette.html
Out-of-scope: a focus on the occurrence records (ecological niche models/species distribution models)
https://github.com/e-sensing/sits
Out-of-scope: a focus on spatiotemporal data cubes
Quoting the authors “RF-GLS extends RF in the same way generalized least squares (GLS) fundamentally extends ordinary least squares (OLS) to accommodate for dependence in linear models.”↩︎
See ?rf_spatial for more details. Also, the 10th method is "hengl" directly following the approach by Hengl et al. (2018).↩︎
mem_multithreshold() function allows for generating MEMs for multiple distance thresholds.↩︎
More information on the custom prediction functions is at https://cran.r-project.org/web/packages/sperrorest/vignettes/custom-pred-and-model-functions.html.↩︎
There are several other partition functions available in the package, including partition_disc(), partition_tiles(), and partition_cv().↩︎
This approach is a form of leave-one-out cross-validation.↩︎